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ABSTRACT 

With the ever-increasing quantity and variety of data world¬ 
wide, the Web has become a rich repository of mathemat¬ 
ical formulae. This necessitates the creation of robust and 
scalable systems for Mathematical Information Retrieval, 
where users search for mathematical information using indi¬ 
vidual formulae (query-by-expression) or a combination of 
keywords and formulae. Often, the pages that best satisfy 
users’ information needs contain expressions that only ap¬ 
proximately match the query formulae. For users trying to 
locate or re-find a specific expression, browse for similar for¬ 
mulae, or who are mathematical non-experts, the similarity 
of formulae depends more on the relative positions of sym¬ 
bols than on deep mathematical semantics. 

We propose the Maximum Subtree Similarity (MSS) met¬ 
ric for query-by-expression that produces intuitive rankings 
of formulae based on their appearance, as represented by 
the types and relative positions of symbols. Because it is 
too expensive to apply the metric against all formulae in 
large collections, we first retrieve expressions using an in¬ 
verted index over tuples that encode relationships between 
pairs of symbols, ranking hits using the Dice coefficient. 
The top-fc formulae are then re-ranked using MSS. Our ap¬ 
proach obtains state-of-the-art performance on the NTCIR- 
11 Wikipedia formula retrieval benchmark and is efficient 
in terms of both index space and overall retrieval time. Re¬ 
trieval systems for other graphical forms, including chemical 
diagrams, flowcharts, figures, and tables, may also benefit 
from adopting our approach. 


Categories and Subject Descriptors 

H.2.4 [Database Management]: Systems —Query Pro¬ 
cessing ; H.3.3 [Information Search and Retrieval]: Re¬ 
trieval models; H.3.4 [Systems and Software]: Perfor¬ 
mance evaluation (efficiency and effectiveness) 
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1. INTRODUCTION 

Mathematical Information Retrieval (MIR) is an impor¬ 
tant emerging area of Information Retrieval research Hi 
16| 24 . Technical documents often include a substantial 
amount of mathematics, but math is difficult to use directly 
in queries. For the most part, large-scale search engines 
do not support formula search other than indirectly, e.g., 
through matching DTTjX strings. Formula queries allow doc¬ 
uments with similar expressions or mathematical models to 
be discovered automatically, providing a new way to search 
and browse technical literature [23]. For mathematical non¬ 
experts, querying based on the appearance of expressions 
may also be useful, for example when students try to inter¬ 
pret unfamiliar notation [21 . Many have had the experience 
of wishing they could search through technical documents 
for similar formulae rather than find words to describe them. 


Figure 
gent 


l] s hows the top of a results page from the new Tan- 
18 formula retrieval engine 1 The 17 hits shown 


are grouped by their structure (exact match, variable sub¬ 
stitution, operator substitution), and groups are ordered by 
the similarity of the contained formulae to the query. Ef¬ 
ficient and effective retrieval becomes more difficult when 
the best matches are even less similar to the query formula 
(e.g., the repository includes larger expressions that include 
pieces similar to one or more parts of the query formula) or 
when wildcards that can match arbitrary symbols or subex¬ 
pressions are included in the query [§]. 

For scalability, Tangent now employs a two-level cascad¬ 
ing search system 20 that provides both query runtime ef¬ 
ficiency and ranking effectiveness for formula search. The 
first level is the core engine, which uses an uncompressed in¬ 
verted index over tuples representing pairs of symbols in an 
expression. This level provides limited support for wildcard 
symbols and can quickly produce an ordered list of candidate 
results using a simple ranking algorithm. The second level 
re-ranks the top candidate results using Maximum Subtree 
Similarity (MSS), a new metric for computing the similarity 
of mathematical formulae based on their appearance. The 
system architecture is summarized in Figure [2] 

Contributions. This paper includes three primary con¬ 
tributions. Our first is the incorporation of substantially 
smaller indices than those used previously 1 4| 18] (Section^, 

: http://www.cs.rit.edu/'dprl/Software.html 
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Figure 1: Tangent Search Results Page (truncated). 


which can obtain strong retrieval results in a scalable sys¬ 
tem. The second contribution is the MSS metric (Section]?], 
which produces an intuitive ordering for retrieved formula 
based on the visual structure of expressions, taking unifi- 
able symbol types into account. The third is a new sym¬ 
bol pair retrieval model (Section [d] that incorporates the 
first two contributions in an efficient and effective two-stage 
cascaded implementation, as demonstrated experimentally 
(Section [8|. In addition, we believe that the form of output 
adopted, namely grouping results by similarity and match 
structure, is an improvement over existing MIR interfaces. 

2. RELATED WORK 

Interest in Mathematical Information Retrieval (MIR) has 
been increasing in recent years, as witnessed by the NTCIR- 
10 jl and NTCIR-11 2j Math Retrieval Tasks held in 2013 
and 2014, respectively. 

Math representations are naturally hierarchical, and often 
represented by trees that may be encoded as text strings. As 
a result, approaches to query-by-expression may be catego¬ 
rized as tree-based, or text-based, as determined by the struc¬ 
tures used to represent formulae. The encoded hierarchies 
commonly represent either the arrangement of symbols on 
writing lines (as in DljgK or Presentation MathML) or the 
underlying mathematical semantics as nested applications 
of operations to arguments (as in OpenMath or Content 
MathML). Both appearance and semantic representations 
have been used for retrieval. 

Text-Based Approaches. In text-based approaches, 
math expression trees are linearized, and often normalized, 
before indexing and retrieval. Common normalizations in¬ 
clude defining synonyms for symbols (e.g., function names), 
using canonical orderings for commutative operators and 
spatial relationships (e.g., to group a+b with b+a and x_i~2 
with x~2_i), enumerating variables, and replacing symbols 
by their mathematical type (e.g., numbers, variables, and 
classes of operators) .17,|24|. 

Although linearization masks significant amounts of struc¬ 
tural information, it allows text and math retrieval to be 
carried out efficiently by a single search engine (commonly 


LucencQ). As a result, most text-based formula retrieval 
methods use TF-IDF (term frequency-inverse document fre¬ 
quency) retrieval after linearizing expressions 12j|T7j. In 
an alternative approach, the largest common substring be¬ 
tween the query formula and each indexed expression is used 
to retrieve BT]gX strings 15]. This captures more structural 
information, but also requires evaluating all expressions in 
the index using a quadratic algorithm. 

Tree-Based Approaches. Tree-based formula retrieval 
approaches use explicit trees to represent expression appear¬ 
ance or semantics directly. These approaches index com¬ 
plete formula trees, often along with their subexpressions 
to support partial matching. Methods have been devel¬ 
oped to compress tree indices by storing identical subtrees 
uniquely [8; and to match expressions using tree-edit dis¬ 
tances with early stopping for fast retrieval [9j. The substi¬ 
tution tree data structure, first designed for unification of 
predicates 5], has been used to create tree-structured in¬ 


dices for formulae 10 . Descendants of an index tree node 


contain expressions that unify with the parameterized ex¬ 
pression stored at that node. 

A recent tree-based technique adapts TF-IDF retrieval 
for vectors of subexpressions and generalized subexpressions 
in which arguments are represented by untyped placehold¬ 
ers 11 . In this method a Symbol Layout Tree is modified 
to capture some semantic properties, normalizing the order 
of arguments for commutative operators and representing 
operator precedences explicitly. 

‘Spectral’ Tree-Based Approaches. An emerging sub¬ 
class of the tree-based approach uses paths or small subtrees 
rather than complete subtrees for retrieval. One system con¬ 
verts sub-expressions in operator trees to words representing 
individual arguments and operator-argument triples [13]. A 
lattice over the sets of generated words is used to define 
similarity, and a breadth-first search constructs a neighbor 
graph traversed during retrieval. Another system employs 
an inverted index over paths in operator trees from the root 
to each operator and operand, using exact matching of paths 
for retrieval [7j. The large number of possible unique paths 
combined with exact matching make this technique brittle. 

Rather than indexing paths from the root of the tree, the 
Tangent math retrieval system stores relative positions of 
symbol pairs in Symbol Layout Trees to create a “bag of 
symbol pairs” representation [14[ |l8 . This symbol pair rep¬ 
resentation supports partial matches in a flexible way, while 
preserving enough structural information to return exact 
matches for queries. Set agreement metrics are applied to 
the bags of symbol pairs to compute formula similarities. For 
example, the harmonic mean for the percentage of matched 
pairs in the query and a candidate (i.e., Dice’s coefficient 
for set similaritjr] prefers large matches of the query with 
few additional symbols in the candidate. Tangent (starting 
with Version 2) also accommodates matrices, isolated sym¬ 
bols, and wildcard symbols and augments formula search 
with text search. Formula retrieval based on bags of symbol 
pairs combined with keyword retrieval using Lucene allowed 


“https://lucene.apache.org/ 

3 Given a query tree T q and a candidate tree T c , let F q and 
F c , respectively, denote a set of their features (such as a set 
of node and edge labels) and let F QiC = F q D F c denote the 
set of features they have in common. Dice’s coefficient of 

similarity (pvyqqfr-|) can then serve as the score for T c . 









Figure 2: Formula Retrieval in Tangent (version 3) 


Tangent to produce the highest Precision@5 result for the 
NTCIR-11 Math-2 main retrieval task with combined text 
and formula queries (92%) [2], 

In this paper, we address needed improvements for Tan¬ 
gent, described in the next section. 

3. PROBLEM STATEMENT 

The math retrieval task we address is to search a cor¬ 
pus to produce a ranked list of formulae (and the pages on 
which those formulae are located) that match a query for¬ 
mula expressed in LM)eX or Presentation MathML, with or 
without the inclusion of wildcard symbols. Formulae ranked 
highly should match the query formula exactly or, failing 
that, closely resemble it. The system should be scalable in 
terms of index size, indexing speed, and querying speed. 

Scalability and Retrieval Effectiveness. As origi¬ 
nally implemented, Tangent is not scalable: indexing time 
is less than 200 formulae per second, producing indices of 
over 1GB for the NTCIR-11 Wikipedia corpus and 30 GB 
for the NTCIR-11 arXiv corpus. Retrieval time is also slow, 
averaging 5 seconds per query for the Wikipedia task (under 
400 thousand distinct formulae) and averaging 3 minutes per 
query for the NTCIR main task (3 million distinct formu¬ 
lae). Furthermore, while retrieval effectiveness is very good, 
there is substantial room for improvement. 

4. FORMULA STRUCTURE MODEL 
4.1 Symbol Layout Tree (SLT) 

Symbols and Containers. Tangent uses a Symbol Lay¬ 
out Tree (SLT) to represent the appearance of a mathemati¬ 
cal formula. Tree nodes represent individual symbols and vi¬ 
sually explicit aggregates, such as fractions, matrices, func¬ 
tion arguments, and parenthesized expressions. In Tangent 
Version 3, all symbols except those representing operators or 
separators (e.g., commas) are prefixed with their type, rep¬ 
resented by a single character followed by an exclamation 
point. More specifically, SLT nodes represent: 

• typed mathematical symbols: numbers (N!n); variable 
names (V!v); text fragments, such as lim, otherwise, 
and such that (T!f) 

• fractions (F!) 

• container objects: radicals (R!); matrices, tabular struc¬ 
tures, and parenthesized expressions (M!/rxc) 

• explicitly specified whitespace (W!) 

• wildcard symbols (?w) 

• mathematical operators 

Because of their visual similarity, all tabular structures, 
including matrices, binomial coefficients, and piecewise de¬ 
fined functions are encoded using the matrix indicator ML 
If a matrix-like structure is surrounded by fence charac¬ 
ters, then those symbols are indicated after the exclamation 


mark. Finally, the indicator includes a pair of numbers sepa¬ 
rated by an a:, indicating the number of rows and the number 
of columns in the structure. For example, M!2x3 represents 
a 2x3 table with no surrounding delimiters and M!()lx5 rep¬ 
resents a 1x5 table surrounded by parentheses. Importantly, 
all parenthesized subexpressions are treated as if they were 
lxl matrices surrounded by parentheses, and, in particular, 
the arguments for any n- ary function are represented as if 
they were a lxn matrix surrounded by parentheses. 

As well as associating a label (e.g., V!x) with every SLT 
node, every node has an associated type ( number, variable, 
operator, etc.). A node’s type is reflected in its label, usu¬ 
ally represented by the part of the label up to an exclama¬ 
tion point (e.g., V!), but node labels preceded by a question 
mark (?) have type wildcard,-, a matrix node’s type includes 
the matrix dimensions, but not its fence characters (e.g., 
M!2x3); and other node labels without exclamation marks 
have type operator. 

Spatial Relationships. Labeled edges in the SLT cap¬ 
ture the spatial relationships between objects represented by 
the nodes: 

1. next (—») references the adjacent object that appears 
to the right on the same line 

2. within ( [V] ) references the radicand of a root or to 
the first element appearing in row-major order in a 
structure represented by M! 

3. element ( —o ) references the next element appearing 
in row-major order in a structure represented by M! 

4. above ( f ) references the leftmost object on a higher 
line (e.g., superscript, over symbol, fraction numera¬ 
tor, or index for a radical) 

5. below ( 4- ) references the leftmost object on a lower 
line (e.g., subscript, under symbol, fraction denomina¬ 
tor) 

6. pre-above ( fi ) references the leftmost object of a 
prescripted superscript 

7. pre-below ( JJ. ) references the leftmost object of a 
prescripted subscript 

An SLT is rooted at the leftmost object on the main baseline 
(writing line) of the formula it represents. Figure [3] shows 
an example of an SLT, where for simplicity, unlabeled edges 
represent the next relationship and types other than wildcard 
are not displayed. 

Creating SLTs. SLTs can be created straightforwardly 
from Presentational MathML by a recursive descent parser. 
For other input formats, we assume that converters such as 
LaTeXMl]^] exist to produce Presentational MathML. 

In most circumstances, whitespace is not represented in an 
SLT. As a result, although Unicode whitespace and related 
characters, such as “invisible times” (u+2062), occasionally 
appear as operators in Presentational MathML expressions, 


J http://dlmf.nist.gov/LaTeXML/ 






































(a) Query Formula and Symbol Layout Tree (SLT) 
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(b) Tuples for SLT Symbol Relationships with Counts 






(c) Example Search Hits. Green: exact matches, Orange: uni¬ 
fied matches, Dashed Nodes: unmatched symbols 


Figure 3: Query Formula with Corresponding Sym- 
bol Layout Tree (SLT), Symbol Pair Tuples, and 
Sample Search Results. 


the number of edges separating them) is less than or equal 
to a specified window size w. 

In addition to normal tuples for an SLT, end-of-line (EOL) 
information can optionally be captured by introducing spe¬ 
cial tuples of the form (last symbol, !0, —»). Such end-of-line 
tuples are likely to improve retrieval, particularly for indi¬ 
vidual symbols and small expressions. 

Tuple information for the example expression in Figure [3] 
along with tuple counts is shown in Figure [3]:). The maxi¬ 
mum path length between two symbols is five, yielding 23 
distinct tuples (19 symbol pair tuples plus four EOL tuples 
with one repetition for ‘i’). However, if the window size w 
is set to 2, then only 16 distinct symbol pairs are stored. 

5. ARCHITECTURAL OVERVIEW 

In order to improve both query runtime efficiency and 
ranking effectiveness, our search system uses a two level cas¬ 
cading approach [20 . After parsing a query formula, the 
first level searches the corpus and returns candidate results. 
The second level then re-ranks those candidates, and finally 
the results are displayed on an HTML page using grouping 
and color coding of match structures for improved clarity. 

The first level of our search system, referred to as the core 
engine, uses an inverted index over tuples defined from sym¬ 
bol pairs of the Symbol Layout Trees of expressions. The 
core engine supports limited query functionality in order to 
produce a small list of candidate expression results quickly 
using the simple ranking metric defined by Dice’s coefficient 
over tuple matches. These candidate expressions are re¬ 
turned together with the lists of documents that contain 
each expression. 

The second level of our system is a re-ranker that imple¬ 
ments the full query functionality to identify tuple matches 
in expressions. The re-ranker scores the candidate expres¬ 
sion results using the more accurate Maximum Subtree Sim¬ 
ilarity ranking metric, as defined in Section]?] The re-ranker 
then combines several expression scores in the candidate ex¬ 
pression results to produce a final document ranking. 

The final query results can be ordered by either formula 
rank or document rank. When the results are presented 
in formula rank order, they are grouped by their maximum 
query subtree overlaps as shown in Figure [l] when presented 
in document rank order they are grouped by document. In 
either case, expressions are displayed color coded to high¬ 
light their maximum subtree overlaps, as shown in Figure]^. 


they are all ignored for the purpose of matching expressions 
in Tangent. 

4.2 SLT Tuple Representation 

As described above, a node in a Symbol Layout Tree can 
have up to seven labeled outgoing edges (with no edge label 
repeating for any node). For a given Symbol Layout Tree, 
Tangent produces a set of tuples that each encodes the rela¬ 
tionship between a pair of symbols occurring on some path 
from the root to a leaf. Given two nodes on such a path, 
we define the relative path between the nodes by the se¬ 
quence of edge labels traversed from the ancestor node to 
the descendant. 

As an optimization that saves both space and time, and 
following the practice of searching via n-grams [22], the new 
version of Tangent does not store all tuples defined, but only 
those for which the distance between symbols (measured by 


6. CORE ENGINE 

The core engine for our system quickly finds a small num¬ 
ber of highly relevant candidate results for a math search 
query, which are later re-ranked. The engine returns these 
top-fc formulae determined using a simple ranking algorithm, 
along with the list of documents containing each formula and 
the first position of that formula in the document. 

Since runtime performance is a high priority, the core en¬ 
gine uses a customized inverted index data structure imple¬ 
mented in C++. In addition, the engine evaluates only a 
subset of the query language functionality to allow the use 
of a fast and simple ranking algorithm that can still find a 
good set of candidate results. 

The input to the indexer is a set of document names and 
the extracted mathematical formulae found in each docu¬ 
ment: {document, formula^ {*, and the input to the search 











component is a single query formula. Each formula is con¬ 
verted to a set of tuples (see Section 4.2 I that serve as words 
do in a normal search engine. 


Index Structures: At index time, an inverted index is 
built over the given document-formula-tuple relationships. 
The index includes postings lists PL1 that map each tuple 
to all formulae containing that tuple. A query containing 
only non-wildcard tuples can thus be implemented by com¬ 
bining the corresponding tuples’ postings lists using an OR 
operator. We store these postings lists as ordered lists of 
formula identifiers (integers), so that the lists can be easily 
combined using a merge algorithm. The engine uses a dic¬ 
tionary D1 to always assign the same formula to the same 
identifier, thus saving both space and time in the engine. 

In order to return document information for query results, 
the engine stores postings lists PL2 mapping each formula 
identifier to the identifiers of the documents containing those 
formulae, along with their first position in the document. To 
improve compression, a dictionary D2 is used for document 
names and another dictionary D3 is used for tuples. 

The core engine supports limited wildcard functionality. 
Query tuples containing a single wildcard as either the an¬ 
cestor or descendant symbol are implemented as iterator ex¬ 
pansions. The engine stores postings lists PL3 that map 
each single wildcard tuple to the set of tuple identifiers that 
match. Assigning tuple identifiers using a dictionary D4 
again gives some compression benefits. Implementing even 
this restricted wildcard functionality can be expensive, since 
the iterator expansion could be quite large[^] 

In summary, the core engine uses two main data struc¬ 
tures: dictionaries convert objects (such as strings or tuples) 
into a compact 0-based range of internal identifiers (integers) 
and postings lists are lists of integer tuples ordered by the 
first integer in the tuple. 

• dictionaries 

D1 : formula —> forrnID 

D2: document —> docID 

D3\ tuple —y tuplelD 

D4- wildcardtuple —> wildcardtuplelD 

• postings lists 

PL1 : tuplelD —> (forrnID, count)" 1 " 

PL2\ forrnID —> (docID, position)" 1 " 

PL3\ wildcardtuplelD —i (tuplelD) + 

These data structures can be combined to produce compres¬ 
sion, ease of storage, and fast access speeds. 


Searching: Query processing follows the architecture shown 
in Figure [2] First, the query is parsed into an SLT and tu¬ 
ples are extracted. Then wildcard tuples are expanded, the 
associated postings lists for each tuple are found, iterators 
over these lists are created, and an iterator tree that im¬ 
plements the query is formed. Next, the iterator tree is 
advanced along formula identifiers in order, the scores are 
calculated, and the top-fc formulae are stored in a heap. Dur¬ 
ing this process, non-wildcard iterators are advanced first so 
that wildcard iterators only match unallocated tuples. As 
optimizations, iterators may skip over some formulae based 


5 The engine does not try to enforce wildcard variable agree¬ 
ment between tuples (wildcard joins), and it ignores multi¬ 
wildcard tuples. An initial implementation handling multi¬ 
wildcard tuples and wildcard joins was found to be approx¬ 
imately a hundred times slower than the current engine for 
a small dataset. 


on thresholds and max-score calculations (see below). After 
the iterators are finished, matching formulae and scores are 
returned along with the associated document names. 

The engine uses Dice’s coefficient over tuples as a simple 
ranking algorithm, counting the number of tuples that over¬ 
lap between the query and a candidate formula using the 
query iterators. The engine also stores the tuple count for 
each formula in an array A1 and uses these values in the 
ranking calculation: 

A1 : forrnID —¥ tuplecount 

Since wildcards can often match multiple tuples in a query 
and overlap with other wildcards, there could be multiple 
ways to count the tuples that overlap. The engine imple¬ 
ments a greedy counting approach by simply assigning the 
matches for tuples when each of the iterators is advanced. 

Parameters: The engine has three configuration parame¬ 
ters: the window size w for formula-to-tuple conversion, the 
optional use of end-of-line tuples, and the number of formu¬ 
lae fc to return for each query. The runtime efficiency and 
ranking effectiveness of configurations using various settings 
of the first two parameters with k = 100 are examined in 
Section [8] 

Optimizations: By using dictionaries and postings lists, 
the engine’s data structures are small enough to be run in 
memory for the datasets being examined, so we do not exam¬ 
ine additional techniques to compress these data structures 
here. Nevertheless, query processing might still be slow, 
even though the data structures are in memory, the ranking 
algorithm is fast, and the use of a dictionary avoids repeated 
processing of duplicate formulae. As a result, various tech¬ 
niques are employed to improve query execution time: 

01 : Avoid processing all postings by allowing skipping in 
query iterators. This functionality is implemented us¬ 
ing doubling (galloping) search [3j. 

02\ Skip formulae based on size thresholds. We use the 
current top-fc candidate list to define a minimum score 
that defines minimum and maximum tuple size thresh¬ 
olds from the definition of Dice’s coefficient. We also 
improve on the effectiveness of these thresholds by 
reordering formula identifiers: sort the formulae by 
size, split into quartiles {< 71 , < 72 , < 73 , 94 }, and then re¬ 
order {q 2 ,reverse(qi),q3,q4.}. 

03: Avoid formulae that match only wildcard tuples when 
the score threshold allows. This is similar to portions 
of the max-score jl9| optimization, only at a coarser 
granularity. 

04 '- Avoid processing all wildcard tuple expansions. If a 
tuple is matched to a wildcard for the next formula, do 
not process the remaining iterators for this wildcard. 

05: Process iterators for large postings lists first. Evalu¬ 
ate the binary operator tree left-first and order tree 
operators descending by size when possible. 

Various improvements to the engine have been left for 
future work, including compression of the postings lists and 
implementing more of the query functionality in the engine. 
Additional improvements in query runtime are also possible 
by using an implementation of weak-AND [2] or a more fine¬ 
grained implementation of max-score 19]. 






7. RERANKING BY MAXIMUM SUBTREE 
SIMILARITY 

Effective information retrieval depends on ranking docu¬ 
ments based on their similarity to a user’s query. For ex¬ 
ample, when using tree-based formula retrieval, one could 
extract a set of features from a query tree T q and each can¬ 
didate indexed tree T Ci , apply Dice’s coefficient of similarity 
as the score for T Ci , and rank candidates by their scores. In 
this section we describe an alternative to Dice’s metric that 
is particularly effective in ranking mathematical formulae. 

Notation: The label on node n in SLT T is denoted \{n). 
The number of nodes in SLT T is denoted |T|. For simplicity, 
we write n G T if n is a node in T and (ni, 712 ) G T if (ni, 712 ) 
is an edge in T. 

Approximate matches of formulae might involve isolating 
corresponding parts of SLTs representing a query and a can¬ 
didate match. Therefore we need a basis for describing such 
a correspondence. 

Definition ( aligned SLTs): SLTs Ti and T 2 are aligned 
if there is an isomorphism / mapping nodes from Ti onto 
nodes from T 2 such that for every edge (n a ,nb) G Ti, there 
is a corresponding edge {f{n a ),f(nb)) G Th that has the 
same label. (Note that node labels in aligned trees need not 
match.) For N a subset of nodes in Ti, we define f(N) = 
{/(n) | n G N}. 

Approximate matches might also involve simple replace¬ 
ments of symbols in one SLT by alternative symbols (e.g., 
x for y or 3 for 2). Naturally, a wildcard symbol can be 
replaced by any symbol. 

Definition ( unified nodes): Node n\ in SLT Ti can be 
unified with node 712 in SLT T 2 , denoted ni —-> 712 , if one of 
the following conditions holds: 

• Both m and 712 have type variable name (V!), 

• Both Tii and 712 have type number (N!), 

• rii has type wildcard (?), or 

• ni has a type other than variable name, number, or 
wildcard and A(ni) = A (712). 

However the SLT for an arbitrary query formula will not 
necessarily align with the SLT for an arbitrary candidate 
match formula. Therefore, we need to consider parts of the 
SLTs that can be aligned. When considering many candi¬ 
date match trees, we are most interested in those parts of 
the query and candidate trees that are similar to the tree 
representing the whole query. 

Definition ( maximally similar subtree): Given SLTs T q 
and T c and aligned SLTs Ti and T 2 with isomorphism / from 
Ti to T 2 , where T\ is a pruned subtre^jof T q and T 2 is a 
pruned subtree of T c , let m = \{n G Ti | n —* /(ti)}|. 
Let | Tl ^j T | be a measure of similarity of Ti to T q with re¬ 
spect to T 2 (Dice’s measure). T\ is then maximally similar 
to T, j if the root of Ti can be unified with the root of T 2 

6 Given a tree T, a pruned subtree is any connected subset 
of nodes from T together with the edges connecting those 
nodes. Thus a pruned subtree is itself a tree, but it need not 
extend to the leaves of T. Henceforth, we will use “subtree” 
to mean “pruned subtree.” 


and there is no other pair of aligned SLTs T[ and T 2 with 
corresponding measure m' , where T[ is a subtree of T q , T 2 
is a subtree of T c , the root of T[ is the same as the root of 
Ti, the root of T) is the same as the root of T 2 , |T(| > |Ti|, 

nnH 2rn ' 2m 

IT'1 + ITgl ^ \T\ | + |Tq | ’ 

Theorem 1. Given SLTs T q and T c and aligned SLTs 
Ti and T 2 with isomorphism f from T\ to T 2 , where T\ is a 
subtree ofT q and T 2 is a subtree ofT c , determining that T\ 
is maximally similar to T q can be performed in time 0(|T g |). 

Proof. Let r be the root of T\. The maximally similar 
subtree to T q and rooted at r can be determined in time 
0\T q \: 

base case: If r is a leaf of T q , then |Ti| = IT 2 I = 1 and 
Ti is maximally similar to T q iff r f(r). This can be 
determined in 0(1) time. 

recursion: Let T = {U \ ti is a maximally similar subtree of 
T q , the root of ti is a child of r, the root of the aligned SLT 
for ti is a child of f(r), and the label on the edge from r to 
the root of ti is the same as the label on the edge from f(r) 
to the root of the aligned SLT for ti}. (This construction is 
unambiguous because the edge labels on all edges starting 
at a node are unique.) The isomorphism / can then be 
extended to include all the nodes in all subtrees in T, and 
the subtree of T q consisting of r and all the subtrees in T will 
be aligned with the subtree of T c having nodes {/(n) | n = 
r V n G ti A ti G T}. Let mi = \{n G U \ n —-> /( ti )}|. 
The maximally similar subtree to T q and rooted at r then 
includes {n \ n = r V 3t, G T( | t .|_^| T > A n G U)} 

iff r f(r) (i.e., we can evaluate the similarity for each 
subtree independently to determine whether or not it is part 
of the maximally similar subtree). Because the outdegree of 
r is bounded by a constant, each step of the recursion can 
be performed in 0(1) time. 

Ti is then maximally similar to T q iff it is the tree thus 
constructed and r —* f(r), and the construction can be 
performed in time 0(\T q \). Q 

Next, when matching with substituted symbols, it is im¬ 
portant that the substitutions are consistent when determin¬ 
ing that two formulae match approximately. 

Definition ( alignment partition): Given T\ and T 2 , two 
aligned SLTs with isomorphism / from Ti to T 2 , an align¬ 
ment partition is a subset of nodes N in Ti such that ( x G 
N Ay G N) => (A(x) = A (y) A A (f(x)) = A {f{y)) A x —* 
f(x)). For node n G Ti, we define P[n) to be the alignment 
partition containing n if it exists and 0 otherwise. (Note 
that n G P(n) -*=> n —+ /(ti)-) For alignment partition A, 
A(A) denotes the label that is common to all nodes in A 
and A (/(A)) denotes the label that is common to all nodes 
in f(A). 

Definition ( matched set of nodes): Given aligned SLTs 
Ti and T 2 with isomorphism / from Ti to T 2 and the set of 
all corresponding alignment partitions, we define a matched 
set of nodes M as 

M = {71 G Ti | ti G P(n) A Vt 1 G M 

([A(ti') = A(n) V A(/(«')) = A(/(n))] => n G P(n))} 

In preparation to preferring matches of large connected parts 
of SLTs, let E(M ) = {( 711 , 712 ) | m G M A 712 G M A 
(7ii,n 2 ) G Ti}. 








S(k) > P{k) > u 2 (k) > (k) > yW 

(1, 0, 3) (1, 0, 2) (1, -1, 2) (0.6, 0, 2) (0.6, -1, 2) 

Figure 4: Maximum Subtree Similarity Scoring for 
Query S(k). 

We need to accommodate situations in which symbol x in 
the query formula is replaced by symbol y in some parts of a 
candidate matching formula and by other symbols elsewhere, 
and where superfluous instances of x or y might appear in 
the candidate match. We suggest the following properties 
for a scoring function, as illustrated in Figure [4] alignments 
with more matched symbols in close proximity to each other 
score higher than those with fewer matched symbols or more 
disconnected matches; if two candidates score equally with 
respect to matched symbols and their proximity, the one 
with fewer superfluous symbols scores higher; and every¬ 
thing else being equal, alignments with identical node labels 
score higher than alignments with distinct node labels that 
can be unified. Tangent uses such a scoring function: 

Definition ( SLT score): Given a query SLT T q , an SLT 
T c for a candidate match, and two aligned SLTs Ti and T 2 
where Ti is a subtree of T q and T 2 is a subtree of T c , let M be 
a matched set of nodes for Ti and T 2 . The score of T c with 
respect to T q , Ti, T 2 , and M is denoted s(T q , T c ; Ti, T 2 , M) 
and defined as the triple composed of the following parts: 

1. the harmonic mean of the fraction of nodes from T q 

preserved by M and the fraction of edges preserved 
by E(M), i.e., h 3 = - if \M\ > 0, 

| M | tmai(|E(M)|,0.5) 

otherwise 0. 

2. the negation of the number of unmatched nodes in T c , 

i.e., \M\ - \T C \. 

3. the number of nodes that match exactly, i.e., |{n £ 
M | A(n) = A(/(n))}|. 

The scores (triples) assigned to any two candidate matches 
can be computed in 0(1) time if Ti, T 2 , and M are given, 
and they can be compared lexicographically to determine 
which candidate ranks higher. 

For aligned SLTs Ti and T 2 with isomorphism / from Ti 
to T 2 and the set of all corresponding alignment partitions, 
we would like to choose a matched set of nodes M that 
produces a high score, but evaluating all matched sets in¬ 
duced by an alignment is too expensive. Therefore we use 
a greedy algorithm to select which partitions to include in 
the matched set of nodes, based on the properties we use for 
scoring: 

1. Let Aq be the alignment partition that contains the 
most nodes; or if more than one partition has the most 
nodes, then let Aq be one of those partitions for which 
A(Ao) = X(f(Ao)) if it exists; otherwise let Aq be any 
of the largest alignment partitions. Initialize M to 
include all nodes in Aq. 

2. Repeatedly identify the largest alignment partition A, 
such that A(Ai) is not the label of any node in M 
and \(f(Ai)) is not the label of any node unified with 
a node in M , choosing Ai to be one where A(Ai) = 
A(/(A;)) if it exists; replace MbyilfU Ai. 

3. Stop when no more alignment partitions can be in¬ 
cluded in M. 


If hash tables are used to record which node labels have 
been included in M and in f(M), checking for duplicate 
labels can be performed in 0(1) time. Partitions can be 
considered one by one in decreasing order of size, which re¬ 
quires 0(|Tj| log(|T ? |)) time to initialize and then 0(|Tj|) to 
enumerate since the number of partitions cannot exceed the 
number of nodes in T q . 

Finally, to compare a query SLT T q against a candidate 
SLT T c , we choose a pair of aligned subtrees that maximizes 
the score for the candidate with respect to the query. 

Definition ( Maximum Subtree Similarity): Given SLTs 
T q and T c , consider pairs of aligned subtrees Tq and Ti 2 as 
follows. 

• The root of Tq can be unified with the root of Tq. 

• Tq is maximally similar to T q . 

The Maximum Subtree Similarity score MSS(T 9 ,T C ) of T c 
with respect to T q is max s(T q ,T c ; Tq, Tq, Mf) over all such 

i 

pairs, where M t is a greedily chosen matched set of nodes. 

Theorem 2. Computing Maximum Subtree Similarity for 
a candidate formula requires time 0(|T c ||Tj| 2 log(|T g |)). 

Proof. The number of pairs of aligned subtrees is at 
most |T, | * |T c |. For each pair, checking whether the roots can 
be unified requires 0(1) time, checking maximum similarity 
requires 0(|T 9 |) time, and computing the score requires con¬ 
stant time plus time 0(\T q \log(\T q \)) to choose M. □ 

We show experimentally that this similarity metric per¬ 
forms very well. 

8. EVALUATION 

In this Section we present experiments designed to ob¬ 
serve the effect of system parameters on index size, retrieval 
time, and search results. We do this using a combination 
of benchmarks, and a human experiment to evaluate the 
similarity of the Top-10 formulae returned by our system to 
query expressions. 

Computational Resources. We use a Ubuntu Linux 
12.04.5 server with 24 Intel Xeon processors (2.93GHz) and 
96GB of RAM. While some indexing operations were paral¬ 
lelized (as noted below), all retrieval times are reported for 
a single process. 

8.1 NTCIR-11 Formula Retrieval Benchmark 

The NTCIR-11 Wikipedia benchmark 16| is 2.5 GB, with 
30,000 articles containing roughly 387,947 unique DTfrjX ex¬ 
pressions. The benchmark includes 100 queries for measur¬ 
ing specific-item retrieval performance, where each query is 
associated with a single target formula in a specific docu¬ 
ment. One or more wildcard symbols are present in 35 of the 
queries. Easy queries (41) and Frequent queries (24) have no 
wildcard symbols, and are distinguished by whether one or 
multiple formulae in the corpus match the target expression. 
Variable queries (27) and Hard queries (8) contain wildcards, 
and are again distinguished by whether one or multiple for¬ 
mulae match the target. Search results are returned as a 
ranked list of (documentld, formulald) pairs. 

Systems are evaluated using two metrics. First, by the 
percentage of targets located (at any rank), and second by 
the Mean Reciprocal Rank (mrr) of successfully retrieved 
targets. The ‘document-centric’ evaluation filters results so 




Table 1: NTCIR-11 Wikipedia Formula Retrieval Benchmark Results (100 Queries). For each system the 
top row shows % recall (over all hits, top-oo), and the bottom row shows the Mean Reciprocal Rank (mrr, 
in %). For Tangent-3, mrr for any formula identical to the target is also shown. 

Document-Centric Formula-Centric 


Participant 

Total 

Easy 

Frequent 

Variable 

Hard 

Total 

Easy 

Frequent 

Variable 

Hard 

TUW Vienna 

97 

100 

100 

93 

88 

93 

100 

96 

89 

63 

(mrr) 

82 

97 

50 

96 

54 

88 

96 

72 

94 

71 

Nil Japan 

97 

98 

100 

93 

100 

94 

98 

96 

89 

88 

(mrr) 

76 

99 

49 

82 

67 

77 

89 

92 

78 

48 

Tangent-2 (R.IT) 

88 

98 

79 

89 

63 

78 

95 

50 

81 

63 

(mrr) 

80 

96 

31 

92 

83 

86 

94 

47 

96 

83 

Tangent-3: Using exact formula location on 

target document 





w=l, EOL 

100 

100 

100 

100 

100 

89 

95 

67 

100 

88 

(mrr) 

83 

100 

55 

95 

41 

85 

100 

58 

93 

32 

w=l , No-EOL 

98 

98 

96 

100 

100 

87 

93 

63 

100 

88 

(mrr) 

82 

100 

56 

94 

31 

84 

100 

59 

92 

32 

Tangent-3: Matching equivalent formulae on target document 





w=l, EOL 






100 

100 

100 

100 

100 

(mrr) 






82 

100 

55 

93 

28 

w=l , No-EOL 






98 

98 

96 

100 

100 

(mrr) 






82 

100 

56 

92 

28 


that document identifiers are listed in their order of appear¬ 
ance in the results. The ‘formula-centric’ results are com¬ 
puted using the complete ranked list of matches. Previously, 
‘formula-centric’ results were computed using specific for¬ 
mula identifiers, so for example, given a query and target 
formula, if the target formula is found at a different location 
within the target document, this is considered a miss. 

At the top of Table [l] NTCIR-11 results from the two 
best systems and from Tangent version 2 are shown as frac¬ 
tions rounded to the nearest percentage [16]. The results 
for Tangent-3 are shown below this in Table [l] For the 
formula-centric evaluation, we present results when defining 
hits using specific formula identifiers, and when accepting 
any identical formula on the target document. The large 
difference with these two definitions of hits indicates that 
our system should index all occurrences of formulae in doc¬ 
uments, rather than just the first unique occurrence of each 
formula in a document as done currently. 

For Tangent-3, all combinations of window sizes w = 
{l,2,3,4,oo} (where 00 is all tuples) and including or ex¬ 
cluding end-of-line symbols (EOL/No-EOL) were used. Sur¬ 
prisingly, different window sizes had very little effect on per¬ 
formance, although adding EOL tuples increased the num¬ 
ber of formulae retrieved by two (e.g. the query ‘s’ could be 
located). We show results for windows size 1 (w = 1) and 
with and without end-of-line symbols in Table [T] 

Tangent-3 obtains the best document recall and mrr re¬ 
sults to date, with perfect recall when using EOL tuples. 
The slightly lower Variable and Hard query mrr values are an 
artifact of additional challenging formulae being located at 
lower ranks. For formula evaluation, the formula recall is im¬ 
proved over Tangent-2 when using exact formula id matches 
for hits, but when we treat equivalent formula on a docu¬ 
ment as a match, we again obtain perfect recall using EOL 
tuples. As before, the mrr is slightly lower than in some 
other results because additional formulae have been found 
which are located at relatively low ranks. 

From a user’s perspective, after formulae are grouped ac¬ 
cording to their SLT matches (see Figure[l]), 89 of the target 
formulae appear in the first (Top-1) group, and 95 within 
the Top-3 groups. If EOL symbols are added, then 97 of the 
queries are found in the Top-3 formula groups. 


8.2 Indexing and Retrieval 

We used both NTCIR-11 collections to test the index 
sizes and retrieval times for our system. In addition to 
the Wikipedia collection described above, we use the much 
larger NTCIR-11 arXiv collection to test the scalability of 
Tangent. The arXiv collection is 174 GB uncompressed, 
with 8,301,578 documents (fragments from arXiv articles) 
and roughly 60 million formulae including isolated symbols. 

Indexing Time. The arXiv data took 43 hours to pre- 
process the documents (using 10 processes), and at most 
an additional 3.5 hours to generate the index using a single 
process (all tuples, with end-of-line tuples). Wikipedia was 
much faster, requiring 260 seconds for preprocessing, and 
at most 95 seconds for index creation. As our document 
pre-processor and re-ranker are implemented in Python, we 
believe that a faster implementation (e.g., in C++) could 
reduce run times by a factor of 4-10 in both cases. 

Index Sizes. As seen in Table [2j when all tuples are 
stored with EOL tuples, the Wikipedia index is 499 MB 
on disk; in contrast, for the arXiv this maximum index 
size is 29 GB. When these index files are loaded into mem¬ 
ory, they consume 2 - 2.5 times their space on disk. Index 
size increases roughly linearly from window sizes 1-4, with 
end-of-line tuples increasing storage by a constant amount. 
For smaller window sizes storage is much smaller; for w = 
1 without end-of-line tuples the index file is 63 MB for 
Wikipedia, and 5.2 GB for arXiv; these are much smaller 
than Tangent-2 (1.3 GB for Wikipedia, and roughly 36 GB 
for the arXiv dataset [14] ). 

Retrieval and Reranking Times. We ran all 100 of 

the NTCIR-11 Wikipedia queries over the arXiv collection 
to test retrieval speed. Retrieval is now much faster than 
Tangent-2 (see Section[3|. We see in Figure]^) that median 
retrieval times are less than 1 second in all conditions, but 
much faster without end-of-line symbols. For the smaller 
Wikipedia collection retrieval is much faster, with average 
retrieval time without EOL at w = 1 being 7ms (cr = 23 ms) 
and 9ms at w = 2 (a = 31ms). 

Re-ranking times are consistent across core parameters 
because the number of candidates reranked is fixed at k = 
100. For Wikipedia, re-rank times are (median, yu, a) = 
(72, 775, 3562) ms. The mean is skewed significantly by a 
























Table 2: Index Sizes for NTCIR-11 Collections 




Index Sizes (MB) 



Wikipedia 

arXiv 

w 

No-EOL 

EOL 

No-EOL 

EOL 

1 

63.1 

72.6 

5,238 

6,036 

2 

94.4 

103.9 

7,419 

8,216 

3 

126.9 

136.4 

9,491 

10,288 

4 

159.7 

169.1 

11,397 

12,194 

oo 

489.2 

498.7 

28,099 

28,897 



a) MSS Distributions (nDCG) b) Retrieval times (ms) 

Figure 5: Distribution of Top-100 nDCG (MSS) 
Scores and Wikipedia Query Retrieval Times for the 
NTCIR-11 arXiv Collection. 

small number of outliers. In one extreme case, re-ranking 
took 46 seconds, while retrieval from the core (w = oo with 
EOL) was just 1.7 seconds. This particular expression is 
very large, with 16 wildcard symbols (Query 52), producing 
large hits with many possible unifications. We do not expect 
to see many queries of this type in common use. 

8.3 Evaluation of MSS Reranking 

We now consider how well MSS-based rankings correspond 
to human perceptions of formula similarity, through evalu¬ 
ating Top-10 results. 

To first select which combinations of parameters to use for 
our human evaluation, we examined the MSS scores of for¬ 
mulae returned by the core before re-ranking. In Figure [5^, 
from the Top-100 hits returned by the core for the NTCIR- 
11 Wikipedia task, we compute normalized Discounted Cu¬ 
mulative Gain (nDCG@100) distributions for the Maximum 
Subtree Similarity Scores in each of the Top-100 hits com¬ 
pared to an MSS ‘gold standard.’ In the gold standard all 
formulae in the Wikipedia collection have been scored for 
each of the 100 Wikipedia queries, and the top-100 formulae 
for each query are used for normalization. We again con¬ 
sider a number of different window size and EOL parameters 
(w = {1, 2,3,4, tx)}, with and without EOL tuples). The 
first five columns show increasing w values without EOL 
tuples, followed by the same range of w values with EOL 
tuples. 

Adding EOL tuples shifts values around the median down. 
We took this as evidence that including EOL tuples was 
not helping return more similar formulae as measured by 
nDCG over the MSS scores. Further, as moving from w = 1 
to w = 2 reduces the variance most dramatically, to keep 
the number of hits for individual participants to evaluate 
reasonable, we chose to consider only w = 1, 2, and oo. 

8.3.1 Experimental Design 


Table 3: Mean and Standard Deviation Likert Rat¬ 
ings for Top-10 NTCIR-11 Wikipedia Hits (21 Par¬ 
ticipants, 10 queries). 



1 

Rank/Position in Top-10 Hits 

2 3 4 

5 

VJ = 1 

■w — 2 

4.54 (0.78) 
4.54 (0.78) 
4.54 (0.78) 

3.79 (1.16) 
3.71 (1.22) 
3.78 (1.18) 

3.48 (1.31) 
3.48 (1.30) 
3.59 (1.22) 

3.20 (1.30) 
3.16 (1.28) 
3.27 (1.15) 

2.83 (1.24) 
2.90 (1.25) 
2.98 (1.24) 


6 

7 

8 

9 

10 

w = 1 

uu — 2 

2.94 (1.22) 
2.93 (1.19) 
2.92 (1.23) 

2.65 (1.19) 
2.85 (1.25) 
2.80 (1.17) 

2.78 (1.20) 
2.57 (1.18) 
2.98 (1.23) 

2.78 (1.20) 
2.74 (1.22) 
2.92 (1.21) 

2.85 (1.24) 
2.80 (1.13) 
2.87 (1.17) 


Data. A set of 10 queries were selected using random 
sampling from the Wikipedia query set. Five of the queries 
contained wildcards, and the other five did not. Some queries 
were manually rejected and then randomly replaced to in¬ 
sure that a diverse set of expression sizes and structures were 
collected. Using the Wikipedia collection, for the three ver¬ 
sions of the core compared (w = (1, 2, oo}, no EOL tuples), 
we applied reranking to the Top-100 hits, and then collected 
the Top-10 hits returned by each query for rating. 

Evaluation Protocol. Participants completed the study 
alone in a private, quiet room with a desktop computer run¬ 
ning the evaluation interface in a web browser. The web 
pages first provided an overview, followed by a demographic 
questionnaire, instructions on evaluating hits, and then fa¬ 
miliarization trials (10 hits; 5 for each of two queries). After 
familiarization participants evaluated hits for the 10 queries, 
and finally completed a brief exit questionnaire. Partici¬ 
pants were paid $10 at the end of their session. 

Participants rated the similarity of queries to results using 
a five-point Likert scale (Very Dissimilar, Dissimilar, Neu¬ 
tral, Similar, Very Similar). It has been shown that present¬ 
ing search results in an ordered list affects the likelihood of 
hits being identified as relevant |6j. Instead we presented 
queries along with each hit in isolation. To avoid other pre¬ 
sentation order effects, the order of query presentation was 
randomized, followed by the order in which hits for each 
query were presented. 

8.3.2 Results 

Demographics and Exit Questionnaire. 21 partic¬ 
ipants (5 female, 16 male) were recruited from the Com¬ 
puting and Science colleges at our institution. Their age 
distribution was: 18-24 (8), 25-34 (9), 35-44 (1), 45-54 (1), 
55-64 (1) and 65-74 (1). Their highest levels of education 
completed were: Bachelor’s degree (9), Master’s degree (9), 
PhD (2), and Professional Degree (1). Their reported areas 
of specialization were: Computer Science (13), Electrical 
Engineering (2), Psychology (1), Sociology (1), Mechanical 
Engineering (1), Computer Engineering (1), Math (1) and 
Professional Studies (1). 

In the post-questionnaire, participants rated the evalua¬ 
tion task as Very Difficult (3), Somewhat Difficult (10), Neu¬ 
tral (6), Somewhat Easy (2) or Very Easy (0). They reported 
different approaches to assessing similarity. Many consid¬ 
ered whether operations and operands were of the same type 
or if two expressions would evaluate to the same result. Oth¬ 
ers reported considering similarity primarily based on simi¬ 
lar symbols, and shared structure between expressions. 

Similarity Ratings. As seen in Table[3] the Likert-based 
similarity rating distributions are very similar, and identi¬ 
cal in a number of places. In all three conditions, average 
ratings increase consistently from the 5th to 1st hits. The 








































































top 4 formula hits all have an average rating higher than ’3,’ 
suggesting that a number of participants felt these formula 
had some similarity with the query expression. After this 
the ratings are less than ‘3’ and sometimes shift. Perhaps 
because matches were not highlighted, in a number of cases 
exact matches were rated as ‘4’ rather than ‘5.’ As was 
found for the NTCIR-11 Wikipedia benchmark, it appears 
that a window size of 1 is able to obtain strong results. This 
is appealing, because this requires the smallest index size 
and has the fastest retrieval times. 

9. CONCLUSION 

We have presented a new technique for ranking appearance- 
based formula retrieval results, using the candidate formula 
subtree with the harmonic mean for matched symbols and 
edges after greedy unification of symbols by type. This Max¬ 
imum Subtree Similarity (MSS) metric prefers large con¬ 
nected matches of the query within the formula. In an 
experiment we found that for the Top-10 hits, the human 
ratings of similarity were consistent with the ranking pro¬ 
duced by our metric. We have also described an efficient 
two-stage implementation of our retrieval model that pro¬ 
duces state-of-the-art results for the NTCIR-11 Wikipedia 
formula retrieval task, using a much smaller index. 

In the future we plan to explore using end-of-line symbols, 
but only for small expressions. This will not require much 
additional space in the index, while greatly reducing the cost 
of wildcard end-of-line tuples. We also plan to support mul¬ 
tiple copies of a formula in a document, devise new methods 
for ranking documents based on multiple matches and/or 
query expressions, and integrate our formula retrieval sys¬ 
tem with keyword search. 
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